Quantitative Trait Loci (QTL) Analysis of Seed Protein and Oil Content in Wild Soybean (Glycine soja)

Soybean seeds consist of approximately 40% protein and 20% oil, making them one of the world’s most important cultivated legumes. However, the levels of these compounds are negatively correlated with each other and regulated by quantitative trait loci (QTL) that are controlled by several genes. In this study, a total of 190 F2 and 90 BC1F2 plants derived from a cross of Daepung (Glycine max) with GWS-1887 (G. soja, a source of high protein), were used for the QTL analysis of protein and oil content. In the F2:3 populations, the average protein and oil content was 45.52% and 11.59%, respectively. A QTL associated with protein levels was detected at Gm20_29512680 on chr. 20 with a likelihood of odds (LOD) of 9.57 and an R2 of 17.2%. A QTL associated with oil levels was also detected at Gm15_3621773 on chr. 15 (LOD: 5.80; R2: 12.2%). In the BC1F2:3 populations, the average protein and oil content was 44.25% and 12.14%, respectively. A QTL associated with both protein and oil content was detected at Gm20_27578013 on chr. 20 (LOD: 3.77 and 3.06; R2 15.8% and 10.7%, respectively). The crossover to the protein content of BC1F3:4 population was identified by SNP marker Gm20_32603292. Based on these results, two genes, Glyma.20g088000 (S-adenosyl-l-methionine-dependent methyltransferases) and Glyma.20g088400 (oxidoreductase, 2-oxoglutarate-Fe(II) oxygenase family protein), in which the amino acid sequence had changed and a stop codon was generated due to an InDel in the exon region, were identified.


Introduction
Soybean (Glycine max L.) is an important crop worldwide that represents a major source of protein and vegetable oil for the human diet and animal feed. As of 2021, soybean was the largest source of protein meal in the world, at 243.6 t, and the second-largest source of vegetable oil (58.7 t) after palm oil (http://soystats.com/, accessed on 12 February 2023). In Asian countries, soybean seeds are used to produce a number of food products, including soymilk, tofu, soybean paste, natto, and soy sauce. In the West, soybean is typically used for soybean meal and seed oil. Soybean seeds generally consist of 40% protein and 20% oil [1], and these traits are negatively correlated with each other [2,3]. For this reason, it is very difficult to improve both traits simultaneously. In addition, because there is a negative correlation between seed yield and protein content [4], high-protein varieties need to be developed with care. In addition to being easily influenced by environmental factors, the protein and oil content of soybean seeds is regulated by the polygenes and quantitatively inherited [5,6]. These polygenes can be divided into major genes, which are less influenced by the environment and that have a significant influence on these levels; and minor genes, which have a weaker influence. 2 of 14 Wild soybean (G. soja Sieb. and Zucc.), the ancestor of cultivated soybean, has high genetic diversity and is thus valuable as a breeding material for soybean breeding programs [7,8]. Various studies have used wild soybean to improve biological stress resistance, abiotic stress tolerance, nutrition, and yields [9]. The average protein content of wild soybean is reported to be higher than that of cultivated soybean, although this may be due to correlations with the yield or oil content [9]. In the study by Chen and Nelson (2004), the protein content of wild and cultivated soybean lines was about 47% and 40%, respectively, while the oil content was 15% and 11%, respectively [10].
After the publication of the soybean genome for the first time by Schmutz et al. (2010) [11], Ha et al. (2012) [12] advanced genomic research further with the integration of physical maps for G. max and G. soja. QTL mapping uses F 2 , backcross (BC), and recombinant inbred line (RIL) populations derived from bi-parental crosses. In many soybean populations, the QTLs for proteins and oils have been mapped to genomic regions on chromosomes 15 and 20 [13][14][15][16][17]. Major QTLs for protein and oil content were identified by Diers et al. (1992) using RFLP markers for the F2 population through the crossbreeding of G. max (A81-356022) and G. soja (PI 468916) [18]. The QTL located on chromosome 15 has been fine-mapped at an interval of 535 kb between simple sequence repeat (SSR) markers (Kim et al. 2016) [19], while the candidate gene for the QTL located on chromosome 20 has been cloned as Glyma.20G85100 encoding the CCT domain [20].
To date, 255 and 322 protein and oil content-related QTLs, respectively, have been identified using bi-parental populations (https://www.soybase.org/, accessed on 12 February 2023). However, these QTLs may include multiple duplicate detections, so the Soybean Genetics Committee has emphasized the importance of experimentally confirming QTLs and adding "cq" in front of the original QTL name to indicate that it has been confirmed [20,21]. In total, 16 QTLs each have been confirmed for protein and oil content (https://www.soybase.org/), and these are distributed across 11 chromosomes, including chromosomes 15 and 20 [18][19][20]. Of these, the only QTLs derived from wild soybean are cqpro-003 and cqoil-004 [22]. Therefore, the purpose of this study was to discover new genes for protein and oil content from wild soybean using two progeny types derived from high-protein wild soybean lines.

Phenotypic Variation in the Protein and Oil Content
In the present study, the seed oil and protein content were measured using seeds harvested from F 3 and BC 1 F 3 progeny lines in 2019 and 2020, respectively. The protein content of Daepung and GWS-1887, the parents of the F 2:3 line, in 2019 was 37.10% and 50.37%, respectively, while the oil content was 19.81% and 5.83%, respectively. In the 190 F 2:3 plants, the protein content ranged from 40.08% to 50.96% with a mean of 45.52%, and the oil content ranged from 7.84% to 16.61% with a mean of 11.59% (Table 1). The protein content of Daepung and GWS-1887, the parents of the BC 1 F 2:3 line, in 2020 was 40.05% and 49.28%, respectively, and the oil content was 16.53% and 5.34%, respectively. In the BC 1 F 2 population, the protein content ranged from 31.50% to 49.54% with a mean of 44.25% and the oil content ranged from 7.73% to 14.84% with a mean of 12.14% (Table 2). The phenotypic variation of the F 2 and BC 1 F 2 populations followed a normal distribution (Figures 1 and 2, respectively).
The protein content of Daepung and GWS-1887, the parents of the BC1F2:3 line, in 2020 was 40.05% and 49.28%, respectively, and the oil content was 16.53% and 5.34%, respectively. In the BC1F2 population, the protein content ranged from 31.50% to 49.54% with a mean of 44.25% and the oil content ranged from 7.73% to 14.84% with a mean of 12.14% (Table 2). The phenotypic variation of the F2 and BC1F2 populations followed a normal distribution (Figures 1 and 2, respectively).

Linkage Maps and QTL Analysis
Linkage maps for the F2 and BC1F2 populations were constructed using polymorphic SNP markers acquired from SoySNP6K Illumina BeadChips (Figures S1 and S2, respectively). In the F2 population, 2592 polymorphism markers were used, with an average chromosome length of 95 cM and an average of 130 markers located across each of the 20 chromosomes (Table 3). The protein content of Daepung and GWS-1887, the parents of the BC1F2:3 line, in 2020 was 40.05% and 49.28%, respectively, and the oil content was 16.53% and 5.34%, respectively. In the BC1F2 population, the protein content ranged from 31.50% to 49.54% with a mean of 44.25% and the oil content ranged from 7.73% to 14.84% with a mean of 12.14% (Table 2). The phenotypic variation of the F2 and BC1F2 populations followed a normal distribution (Figures 1 and 2, respectively).

Linkage Maps and QTL Analysis
Linkage maps for the F2 and BC1F2 populations were constructed using polymorphic SNP markers acquired from SoySNP6K Illumina BeadChips (Figures S1 and S2, respectively). In the F2 population, 2592 polymorphism markers were used, with an average chromosome length of 95 cM and an average of 130 markers located across each of the 20 chromosomes (Table 3).

Linkage Maps and QTL Analysis
Linkage maps for the F 2 and BC 1 F 2 populations were constructed using polymorphic SNP markers acquired from SoySNP6K Illumina BeadChips (Figures S1 and S2, respectively). In the F 2 population, 2592 polymorphism markers were used, with an average chromosome length of 95 cM and an average of 130 markers located across each of the 20 chromosomes (Table 3). In the BC 1 F 2 population, 1063 polymorphism markers were used, with an average chromosome length of 60 cM (except chromosome 12, which did not show polymorphism) and an average of 130 markers located on each of the 19 chromosomes (Table 4). The average genetic interval for both the F 2 and BC 1 F 2 populations was 1.1 cM. QTLs for the protein and oil content in both populations were identified using MQM mapping analysis. In the F 2:3 population, QTLs for protein and oil content were identified on chromosomes 20 and 15, respectively (Figure 3), and these QTLs accounted for 17.2% and 12.2% of the phenotypic variation, respectively, with additive effects of the alleles on these traits of −1.10 and 0.59 (Table 5).   On the other hand, in the BC1F2:3 population, QTLs for protein and oil content were both identified on chromosome 20 ( Figure 4), accounting for 15.8% and 10.7% of the phenotypic variation, respectively, with additive effects of the alleles on these traits of −1.49 and 0.62 (Table 6).  On the other hand, in the BC 1 F 2:3 population, QTLs for protein and oil content were both identified on chromosome 20 (Figure 4), accounting for 15.8% and 10.7% of the phenotypic variation, respectively, with additive effects of the alleles on these traits of −1.49 and 0.62 (Table 6).  On the other hand, in the BC1F2:3 population, QTLs for protein and oil content were both identified on chromosome 20 (Figure 4), accounting for 15.8% and 10.7% of the phenotypic variation, respectively, with additive effects of the alleles on these traits of −1.49 and 0.62 (Table 6).

Crossover Detection
From the BC 1 F 3 population, two lines with a heterozygote at a position expected to represent a high-protein-promoting gene on chromosome 20 were selected and advanced to produce a BC 1 F 3:4 generation, followed by genotyping and protein content measurement. In Table 7, crossover occurred at 33,049,242 bp, and the individual with the genotype of the parent Daepung had average protein levels of 44.10 g, while individuals with the genotype of the parent GWS-1887 had a protein level of 47.58 g. In addition, crossover occurred at 32,603,292 bp, and individuals with the genotype of the parent Daepung had average protein levels of 45.25 g, and the individual with the genotype of the parent GWS-1887 had a protein level of 48.54 g. Given the genotypes of the two lines, it was predicted that the range of genes related to high protein levels is present at least downstream of position 32,603,292 bp.

Genome Re-Sequencing
Genome-resequencing analysis was conducted on wild soybean GWS-1887, which has a high protein content. The total number of sequencing reads was about 260 million with a sequencing depth of 38.6× and a total size of about 39 billion bp, while the coverage for the reference genome was 95.3%. Compared with the Williams 82 reference genome, approximately 4.7 million SNPs and 0.9 million InDels were identified in GWS-1887 (Table 8). A total of 21 protein-related candidate genes with InDels were detected between Gm20_27578013, which is a molecular marker identified as a result of QTL mapping in the BC 1 F 2:3 population, and Gm20_32603292, which was identified as the crossover site in BC 1 F 3:4 (Table 9).   Of these, large InDels were identified in five genes (Glyma.20G085300, Glyma.20G085450, Glyma.20G085800, Glyma.20g087000, and Glyma.20G088000), small InDels were most common in 40 loci in Glyma.20G085300, and three large InDels were present in Glyma.20G088000. In particular, InDels occurred in the exon region of genes Glyma.20g088000 and Glyma.20g088400, and the InDels of the two genes generated stop codons with amino acid frameshifts ( Figure 5). Of these, the stop codon in Glyma.20g088000 is expected to greatly simplify the structure of the protein (Figure 6). QTL analysis of the protein and oil content in soybean has been well-studied in previous research. The present study searched for QTLs related to high protein content using two F2 and BC1F2 populations derived from a cross between cultivated soybean variety Daepung and wild soybean variety GWS-1887. The protein content of cultivated soybean is known to be about 40% [1], whereas wild soybean GWS-1887 has protein levels close to Of these, the stop codon in Glyma.20g088000 is expected to greatly simplify the structure of the protein (Figure 6). Of these, the stop codon in Glyma.20g088000 is expected to greatly simplify the structure of the protein (Figure 6). QTL analysis of the protein and oil content in soybean has been well-studied in previous research. The present study searched for QTLs related to high protein content using two F2 and BC1F2 populations derived from a cross between cultivated soybean variety Daepung and wild soybean variety GWS-1887. The protein content of cultivated soybean is known to be about 40% [1], whereas wild soybean GWS-1887 has protein levels close to 50% [9,10,23]. This suggests that wild soybean GWAS-1887 may be useful for QTL analysis

Discussion
QTL analysis of the protein and oil content in soybean has been well-studied in previous research. The present study searched for QTLs related to high protein content using two F 2 and BC 1 F 2 populations derived from a cross between cultivated soybean variety Daepung and wild soybean variety GWS-1887. The protein content of cultivated soybean is known to be about 40% [1], whereas wild soybean GWS-1887 has protein levels close to 50% [9,10,23]. This suggests that wild soybean GWAS-1887 may be useful for QTL analysis in terms of mapping the genetic regions associated with high protein content in soybean. However, the crossbreeding between G. max and G. soja may lead to linkage drag and consequent negative introgression such as reduced yields [9]. Daepung exhibited a higher annual variation in its protein content (37.10% in 2019 and 40.05% in 2020) than did GWS-1887 (50.37% in 2019 and 49.28% in 2020) (Tables 1 and 2). Several studies have reported that a lack of soil moisture reduces the protein content of soybean [24][25][26]. The rainfall in the experimental area in August 2019 during the soybean development period was lower than normal, while the rainfall in August 2020 was above average (http://www.kma.go.kr/, accessed on 12 February 2023). Therefore, it appears that the protein content of wild soybean has a higher environmental stability than cultivated soybean.
Previous QTL analysis of the protein and oil content in soybean seeds has been conducted with various populations, with the identified QTLs distributed across 20 chromosomes (https://www.soybase.org/). Of these, major QTLs for protein and oil content are present on chromosomes 15 and 20 [18], and several researchers have attempted to narrow down their precise location [13,[19][20][21][22]27]. In the present study, the QTLs related to protein and oil content in the F 2:3 population were identified as Gm20_29512680 and Gm15_3621773, respectively, whereas in the BC 1 F 2:3 population, marker Gm20_27578013 was identified for both protein and oil. Kim et al. (2016) reported the fine-mapping of a backcross line of Williams 82 and PI 407788A with 96 BARCSOYSSR markers and found that the QTL related to the protein and oil content on chromosome 15 was present in a 535 kb region from the physical position 3.59 Mbp to 4.12 Mbp [19]. These results are consistent with the SNP marker Gm15_2621773 at the physical position 3.63 Mbp detected for oil content in the F 2:3 population in the present study. Recently, cqSeed protein-003 located on chromosome 20 was narrowed down through fine-mapping to a 77.8 kb region between genetic marker BARCSOYSSR_20_0670 and BARCSOYSSR_20_0674 (31.74 to 31.82 Mbp), and the Glyma.20G85100 gene encoding the CCT domain was identified as a candidate gene involved in protein content [20].
In our results, protein-related QTLs were mapped to Gm20_29512680 at 30.  [20]. The reason for these inconsistencies could the low LD with the surrounding markers [20], and it is known that the wild soybean variety used as a parent in this study has a lower LD than cultivated soybean [28]. For more accurate confirmation of the location, crossovers around the QTL detected in BC 1 F 2:3 were identified but could not be found, and it was concluded that a crossover occurred at markers Gm20_33049242 and Gm20_32603292 in the two BC 1 F 3:4 lines, which was advanced one generation by selecting high-protein lines. Therefore, it was predicted that the protein-related gene is present in the region downstream of Gm20_32603292.
Based on the results of QTL mapping, InDels were then searched for in the candidate genes located at around 30 Mbp between the Williams 82 reference genome and GWS-1887. Interestingly, no mutation was detected in the Glyma.20G85100 gene of the CCT motif family protein, which was recently cloned as a major protein-related gene [20]. These results suggest that other major protein-related genes may be present in a similar genetic region. Wang et al. (2021) selected protein-related candidate gene Glyma.20g088000 using DEG analysis via RNA-seq, and it was found that Glyma.20g088000 had a significant difference in its sequence between the high-protein Nanxiadou 25 and low-protein Tongdou 11 varieties due to InDels [16]. Interestingly, Glyma.20g088000 (S-adenosyl-L-methionine-dependent methyltransferase) was also selected as a candidate gene in the present study because small and large InDels occurred within several regions of the gene. In addition, Glyma.20g086900 (aldehyde dehydrogenase-related) and Glyma.20g088400 (oxidoreductase, 2-oxoglutarate-Fe(II) oxygenase family protein) genes were selected by Lee et al. (2019) as a result of a GWAS for the soybean seed protein content from maturation groups I to IV [17]. These two genes were also identified as candidate genes in this study.
In the present study, it was confirmed that Glyma.20g088000 and Glyma.20g088400 had a large InDel in the 5 first exon and a small InDel in the 3 third exon, respectively ( Figure 5). Nonsense mutations that create stop codons and frameshifts in which amino acids are rearranged can disrupt the function of a gene [29]. In one example, truncated polypeptides generated as a result of nonsense mutations resulted in the loss of anthocyanin pigments associated with the color of soybean flowers [30]. In particular, the stop codon in Glyma.20g088000 is expected to greatly simplify the structure of the protein (Figure 6), thus it is likely to have a significant effect on the expression of its function. Although these candidate genes have potential functions in metabolism, the mechanisms of how they relate to seed composition require further study. In addition, the results collectively suggest that protein content may be regulated by the complex interaction of multiple genes located at around 30 Mbp on chromosome 20.

Plant Materials
In the present study, 180 F 3 and 90 BC 1 F 4 populations derived from a cross between Daepung and GWS-1887 were analyzed. Daepung, which was used as the female, recurrent parent, is an elite Korean variety that is strongly resistant to disease and shattering and has high yields [31], while GWS-1887, which was used as the male parent, was selected from the core collection of wild soybean accessions from the Rural Development Administration (RDA) because it has a protein content of 50% or higher. In the summer of 2018, F 1 seeds were obtained from artificial crossbreeding in an experimental field at Chonnam University (Gwangju, 36 • 17 N, 126 • 39 E, Republic of Korea). The F 1 seeds were planted in a greenhouse during the 2018-2019 winter season to obtain F 2 seeds, with the generation then advanced from F 2 to F 3 in the summer of 2019. At the same time, F 1 seeds were backcrossed in the summer of 2019 to obtain BC 1 F 1 seeds. The produced BC 1 F 1 seeds were planted in a greenhouse during the 2019-2020 winter period to obtain BC 1 F 2 seeds. Finally, in the summer of 2020, the BC 1 F 2 generation was advanced and BC 1 F 3 seeds were obtained.

Analysis of Protein and Oil Content
All harvested seed samples were dried at 40 • C for 7 d and then pulverized using a coffee grinder to produce 3 g each for subsequent analysis. The crude protein content was measured using the Kjeldahl method. Reagents required for digestion, distillation, and titration were prepared, including 0.1N hydrochloric acid, a decomposition accelerator (containing 10 g of potassium sulfate and 1 g of copper sulfate), 40% sodium hydroxide solution, and 1% boric acid solution with 100 mL and 70 mL of Bromocresol green and methyl red solutions, respectively. The sample solution was prepared by mixing 0.7-1.0 g of the ground seed and 7-8 g of the decomposition accelerator with 10 mL of sulfuric acid in a decomposition bottle. Digestion was carried out by heating the sample solution at a slow ramping rate until no visible bubbles remained and the solution became transparent. The solution was then analyzed using a Kjeltec 1030 Autoanalyzer (FOSS Tecator AB, Hogans, Sweden) following the manufacturer's instructions.
The crude oil content was measured using ether extraction. For this, an oil metering bottle was pre-dried at around 95-100 • C for 2 h followed by cooling in a desiccator for 30 min. Following this, 2-3 g of the sample wrapped in No. 2 filter paper was dried at the same temperature and for the same duration of time as the oil metering bottle. After drying, the sample was placed in a Soxtec 1043 instrument (FOSS Tecator AB, Hogans, Sweden), and subjected to a flow of ether at 80 • C for 8 h to extract the oil. The processed ether was then collected in an oil metering bottle and subsequently dried (95-100 • C for 3 h) followed by cooling in a desiccator (40 min) and weighing. The oil content was determined by subtracting the weight of the empty oil metering bottle from the weight of the bottle containing the extract.
Crude oil(%) = Oil bottle weight after extraction − Oil bottle weight before extraction Sample weight × 100

DNA Extraction and SNP Genotyping
Fresh leaf tissue was collected at the beginning of growth for DNA extraction and ground using liquid nitrogen in a mortar. Genomic DNA was isolated from 20 mg of lyophilized leaf tissue using a DNeasy Plant Mini Kit (QIAGEN, Valencia, CA, USA) according to the manufacturer's protocol. The quality and quantity of the extracted total DNA were verified using a Nano-MD UV-Vis spectrophotometer (Scinco, Seoul, Republic of Korea). The extracted DNA was stored in a freezer at −80 • C until further use.

Genetic Linkage Analysis
A genetic linkage map was constructed using the Kosambi mapping function in Joinmap v4.1 (Kyazma, Wageningen, The Netherlands). For genetic analysis, MQM mapping was employed with MapQTL 6.0 (Kyazma, Wageningen, The Netherlands). In the F2 population, permutations were conducted to determine the genome-wide significance threshold for the LOD scores, with the number of permutations set at 1000. In the BC1F2 population, an LOD score of ≥3.0 was set as the threshold for determining the presence of a QTL. LOD graphs and the location maps for the QTLs were created with MapChart2.2.

Re-Sequencing
Re-sequencing analysis was commissioned by Insilicogen (Yongin, Republic of Korea) and performed using an Illumine Novaseq 6000 platform. A library was constructed from DNA fragments with 151 bp paired ends read using a DNA Sample Prep Kit (Illumina) following the manufacturer's instructions. An analysis pipeline for detecting mutations in the sequencing data for the entire genome was employed with an NF-Core/SAREK workflow [32]. The snpEff tool was used for genetic variation annotation and effect prediction, while the snpEff database was referenced to Glycine max var. Williams 82 [11]. The whole genome sequencing data of GWS-1887 were deposited in NCBI under the BioProject PRJNA915129.

Conclusions
In this study, QTL mapping of the protein and oil content in soybean seeds was conducted using two progeny populations derived from high-protein wild soybean lines. QTL was detected in the region of cqPRO-003, which has been previously reported as a major QTL related to protein content, but as a result of resequencing, no difference was observed from the recently cloned candidate gene cqPRO-003. On the other hand, new candidate genes Glyma.20g088000 and Glyma.20g088400, which contained InDels, were discovered. This suggests that the protein content may be regulated by the complex interaction of multiple genes and associations other than those that have previously been reported.